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Abstract 

Database management systems (DBMSs) carefully optimize 
complex multi-join queries to avoid expensive disk I/O. 
As servers today feature tens or hundreds of gigabytes of 
RAM, a significant fraction of many analytic databases be¬ 
comes memory-resident. Even after careful tuning for an 
in-memory environment, a linear disk I/O model such as 
the one implemented in PostgreSQL may make query re¬ 
sponse time predictions that are up to 2 x slower than the 
optimal multi-join query plan over memory-resident data. 
This paper introduces a memory I/O cost model to iden¬ 
tify good evaluation strategies for complex query plans with 
multiple hash-based equi-joins over memory-resident data. 
The proposed cost model is carefully validated for accuracy 
using three different systems, including an Amazon EC2 
instance, to control for hardware-specific differences. Prior 
work in parallel query evaluation has advocated right-deep 
and bushy trees for multi-join queries due to their greater 
parallelization and pipelining potential. A surprising find¬ 
ing is that the conventional wisdom from shared-nothing 
disk-based systems does not directly apply to the modern 
shared-everything memory hierarchy. As corroborated by 
our model, the performance gap between the optimal left- 
deep and right-deep query plan can grow to about 10 x as 
the number of joins in the query increases. 

1. Introduction 

Ad-hoc analytical queries commonly perform a series of 
equi-joins between different tables. A database management 
system needs to accurately forecast the response time of dif- 
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ferent join evaluation orders to provide timely answers to 
these queries. A major component of this prediction is an 
estimation of the cost to access the data. These I/O cost mod¬ 
els have traditionally assumed that all data reside in a single 
directly-attached hard disk. However, cloud infrastructure 
providers such as Amazon EC2 offer memory-optimized in¬ 
stance types with dozens of vCPUs and hundreds of giga¬ 
bytes of memory. In this environment, a significant fraction 
of the database fits in memory and must be processed in 
parallel by multiple CPU cores for timely responses to user 
queries. 

The database research community has been investigating 
query processing techniques for memory-resident datasets 
for more than three decades 32] and has developed de¬ 
tailed cost models for single-core CPUs with deep cache 
hierarchies 0 M- The commoditization of multi-core, 
multi-socket systems have rekindled research interest in 
query execution j 1_2 , ^ a nd efficient processing techniques 


for modern hardware ll35L l45h . In-memory hash-based join 


^orithms, in particular, have received significant attention 
15l Isi l33r . The ramifications of ample single-node paral¬ 
lelism for multi-join query evaluation over in-memory data, 
however, are not as well understood. 

As a consequence, systems implementers find themselves 
in the dark regarding how to best synthesize query plans 
for complex multi-join queries that efficiently use all CPU 
cores, and turn to prior research in parallel query evalua¬ 
tion for parallel databases with a shared-nothing architec¬ 
ture. A multi-join query is commonly evaluated as a series 
of hash-joins between two tables. Eor every hash join, the 
query optimizer chooses which input will be stored in the 
hash table (henceforth referred to as the build or left side) 
and which input will be used to probe the hash table (hence¬ 
forth referred to as the probe or right side). Prior work in 
parallel hash-based multi-join query evaluation has advo¬ 
cated right-deep query trees, because all build (“left”) sub¬ 
trees can be processed in parallel and then probed in a single 
long pipeline liisl] . This conventional wisdom on good query 
plans is commonly encoded in ad-hoc heuristics in paral- 
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lei query optimizers to avoid the combinatorial explosion 
of the search space S. Although a single database server 
may have as many CPU cores as a parallel database in the 
past, the hardware architecture does not otherwise resem¬ 
ble a shared-nothing environment. Existing query optimiz¬ 
ers need to be augmented to accurately model the resource 
needs of ad-hoc queries over in-memory data. 

This paper focuses on hash-based join evaluation strate¬ 
gies for main-memory databases and contributes a detailed 
cost model that accurately predicts the query response time 
of ad-hoc multi-join queries. Our thesis is that the response 
time for processing a complex multi-join query using all 
CPU cores is proportional to the number of memory oper¬ 
ations weighted by the performance cost for different access 
patterns. By exhaustively exploring all possible query plans 
for a join between 4 tables, we find that the memory access 
cost that is predicted by our model proves to be an accu¬ 
rate proxy for elapsed time. We then demonstrate that lin¬ 
ear disk I/O models, such as the current cost model in Post- 
greSQL, cannot accurately account for the access patterns to 
and from main memory even after optimally adjusting the 
model weights through linear regression. 

The proposed cost model is conceptually simple as it does 
not account for the multi-layered cache hierarchy or any 
NUMA effects. We hnd that the loss of accuracy due to these 
two factors is minor for the cache-oblivious non-partitioned 
hash join algorithm that we model. Yet by only focusing on 
the top layer of the memory hierarchy, our model gains in 
versatility and can readily adapt to the underlying hardware. 
The conceptual simplicity of the model is particularly im¬ 
portant in light of the limited topological information about 
the CPU and cache hierarchy that can be gleaned from a VM 
in a cloud environment. Experiments with on-premises sys¬ 
tems and Amazon EC2 show that our model accurately pre¬ 
dicts how the response time of individual query plans will 
change based on the performance characteristics of different 
systems. 

A surprising finding from our thorough experimental 
evaluation is that some left-deep query trees can be more 
efficient than their bushy and right-deep tree counterparts 
if the join pipeline must run until completion because they 
result in substantially less memory traffic during execution. 
As corroborated by our model, the performance gap between 
the optimal left-deep and right-deep query tree can grow to 
about lOx as the number of joins in the query increases. 
This finding challenges the widely held belief among sys¬ 
tems implementers that MPP-style optimization is “good 
enough” for parallel in-memory query processing. 

2. Background and Related Work 

Query processing techniques for in-memory data 

Research in main memory database management systems 
commenced more than three decades ago. Early work in¬ 
vestigated join algorithms for main memory database sys¬ 


tems by DeWitt et al. 0 and an investigation of in-memory 
query processing techniques by Lehman and Carey 13211 . As 
the CPU architecture evolved, additional hardware opera¬ 
tions were shown to become performance bottlenecks. Ail- 
amaki et al. (12] studied the performance bottlenecks of four 
commercial database management systems and highlighted 
the importance of minimizing L2 data cache stalls and LI 
instruction cache stalls. Shatdal et al. lEi proposed to make 
the join algorithm cache-conscious to improve locality and 
join performance W adding a partitioning step before the 
join. Boncz et al. |LlO|] proposed radix-based query process¬ 
ing techniques to improve in-memory performance by re¬ 
ducing cache and TLB misses. 

As multi-core CPU architectures became the norm, the 
database research community explored alternatives to the 
established query processing paradigm. Harizopoulos et 
al. iS proposed a pipelined model for query processing 
where operators map to hardware cores and can be shared 
among multiple queries. Arumugam et al. 10] developed a 
push-based dataflow engine for query processing. Krikellas 
et al. 10 investigated techniques to dynamically generate 
code for query execution. Neumann 1431] proposed to fur¬ 
ther improve performance by compiling queries into native 
machine code using LLVM. Giannikis et al. il designed a 
system to process thousands of concurrent queries by shar¬ 
ing computation and intermediate results. 

There is also a lively and ongoing debate on efficient 
parallel in-memory equi-join techniques. Balkesen et al. (Islj 
compare sort-merge and radix-hash joins in a multi-core, 
main-memory environment and claimed that radix-hash join 
was superior to sort-merge join. Leis et al. ll^ break input 
data into small fragments and assign each to a thread to run 
entire operator pipelines. By controlling the dispatch of data 
fragments, query execution can be parallelized dynamically 
in a NUMA-aware fashion. Polychroniou and Ross iS pro¬ 
posed a family of main-memory partitioning algorithms that 
includes hash, radix and range partitioning. Li et al. OSll 
studied data shuffling in a NUMA system and improved the 
performance by employing thread binding, NUMA-aware 
thread allocation and relaxed global coordination. Lang et 
al. 03 in optimized the hash join for NUMA architectures 
by optimizing the physical representation of the hash table. 
Barber et al. ||2] introduced concise hash tables which sig- 
nihcantly reduce memory consumption while still offering 
competitive performance. 

Researchers also have proposed cost models for in¬ 
memory data processing. Manegold et al. ill] proposed a 
hierarchical model to capture all levels of a cache hierarchy 
and show that this can accurately capture the total response 
time of quick sort, hash join and partitioning operations after 
calibrating for the CPU cost of each algorithm. 


Parallel query evaluation strategies 

Multi-join query plan optimization is a difficult problem be¬ 
cause of the uncertainty over the cardinality of intermedi- 
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ate results and the large optimization space. Chaudhuri sum¬ 
marized the foundations and significant research findings of 
query optimization lll3|] . and Lohman has recently identified 
open problems in query optimization 0. 

Prior research has extensively studied parallel and dis¬ 
tributed query optimization techniques. R* was an early re¬ 
search prototype that optimized queries to minimize data 
transfer costs in a shared-nothing environment S. Chen 
et al. lIT^ proposed to generate parallel query plans in two 
phases: first generate the optimal query plan for a single 
node, and then to parallelize it for a multi-processor en¬ 
vironment. Hong and Stonebraker lEsIl showed that two 
phase optimization produces good query plans for shared- 
memory systems. Researchers have also studied parallel 
hash join algorithms. DeWitt and Gerber in introduced 
multi-processor versions of popular hash join algorithms 
for a single join and evaluated their advantages and disad- 
vantages in a multi-processor environment. Recently, Chu et 
al. ilTT propose a novel distributed multi-way join algorithm 
that has been theoretically proven to have better performance 
than evaluating the multi-way join as a series of binary joins. 

The performance of different multi-join evaluation strate¬ 
gies has also been extensively studied. Schneider and De- 
Witt ll4^ studied multi-way join query processing through 
hash-based methods in the Gamma parallel database system. 
They observed that trees with shallow hash-join build phases 
and a long probe pipeline (“right-deep” trees) can provide 
significant performance advantages in large multiprocessor 
database machines. The key advantages of right-deep query 
trees are the ample parallelization potential, as each build 
subtree can be executed concurrently, and that no intermedi¬ 
ate results need to be materialized in the long probe pipeline 
because results are streamed from one operator to the next. 

Right-deep trees introduce the problem of processor allo¬ 
cation to each join sub-tree during parallel evaluation, and 
Lu et al. 1^ use a greedy algorithm to generate query 
plans for multi-join queries that determines the order of 
each join, but also the number of parallel joins and the pro¬ 
cessor assignment to each join. loannidis and Kang ll29n 
show that optimization for bushy and left-deep trees is eas¬ 
ier than optimizing left-deep trees alone. Chen et al. Midi] 
proposed bushy trees with right-deep pipelines, which they 
call segmented right-deep trees, for efficient execution of 
multi-join queries. Query evaluation strategies for shared- 
nothing main-memory database management systems have 
been examined in the context of the PRISMA/DB system 
J^ . Wilschut and Apers have identified synchronization 
costs as the performance bottleneck from parallelism EH]. 
Recently, Giceva et al. Q have studied how to allocate re¬ 
sources and deploy query plans on multi-cores in the context 
of their operator-centric SharedDB system. 

Liu and Rundensteiner 11361] introduced segmented bushy 
query trees for the efficient evaluation of multi-join queries. 
Pipeline delays have also been studied in research by Desh- 
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Figure 1. The non-partitioned hash join algorithm. 


pande and Hellerstein illSIl who propose to reduce pipeline 
delays by plan interleaving between plan fragments. Ahmed 
et al. HI] have proposed a technique to optimize bushy join 
trees for snowstorm queries in the Oracle database system. 

Machine learning techniques have also been explored to 
predict the response time of a query. IBM’s DB2 introduced 
LEO, a learning optimizer, to adjust the queiy optimizer’s 
prediction based on the observed query cost Zhang 

et al. II 54 II proposed a statistical learning approach, COMET, 
to adapt to changing workloads for XML operations. Dug¬ 
gan et al. 0 used machine learning techniques to predict 
the performance of concurrent workloads. Akdere et al. |3] 
introduced two learning-based models, a plan-level model 
and an operator-level model, and contributed a hybrid model 
to predict query performance. Li et al. H combined re¬ 
gression tree models and scaling functions to estimate the 
resource consumption of individual operators. 


The non-partitioned in-memory hash join algorithm 

Our proposed memory I/O model captures the fundamental 
memory access operations from the popular non-partitioned 
in-memory hash join algorithm and the data structure layout 
optimization by Balkesen et al. 0). The main data structure 
is a hash table that is shared by all threads (see Eigure [T]). 
The hash table is a contiguous array of buckets. Each bucket 
contains metadata and the payload. Metadata are header 
information and a pointer to overflow buckets if this bucket 
has spilled. The hash join is divided into two phases: the 
build phase and probe phase. During the build phase, an 
empty hash table is created and the build table R is divided 
into equi-sized parts which are assigned to different threads. 
All threads then scan their input sequentially, hash the join 
key of every tuple, lock the bucket, write the tuple into the 
bucket, and then unlock the bucket. The probe phase starts 
after the entire build table R has been consumed. In the probe 
phase, all threads sequentially scan the probe table S, hash 
the join key of every tuple, and then join with the tuple in 
the corresponding bucket if the join keys are equal. 


3. Modeling memory I/O 

A fundamental challenge in building memory I/O models 
is the pull-based nature of data movement throughout the 
memory hierarchy. In stark contrast to the disk I/O hierar¬ 
chy where data movement is explicitly initiated by the ap¬ 
plication, application control of memory I/O is limited to 
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Mnemonic Access pattern 

SR Read one cache line sequentially 
RR Read one random cache line 
SW Write one cache line sequentially 
RW Write one random cache line 

Table 1. Access patterns captured by the model. 

prefetch, flush and non-temporal access hints. Complicating 
matters further, memory loads and stores are not only trig¬ 
gered when retrieving user data for query processing, but 
also when manipulating local variables, accessing the call 
stack and co-ordinating with other CPU cores. 

Determining the components of a memory I/O cost model 
is therefore a delicate balancing act. It is equally important 
to include sufficient detail to make accurate predictions but 
also avoid adding too many parameters that may exaggerate 
the effect of minor fluctuations in the model inputs because 
of over-fitting. The latter is particularly important for mem¬ 
ory I/O cost models that need to be generic to avoid obsoles¬ 
cence in light of the quick pace of innovation in hardware. 

When constructing performance models for general algo¬ 
rithms it is important to model every level of the cache hier¬ 
archy. The non-partitioned hash join algorithm, however, is 
cache-oblivious and memory-bound. Therefore, our thesis is 
that when modeling a sequence of non-partitioned hash joins 
the query response time is dominated by the cost of access¬ 
ing main memory and not individual caches. 

Not all access patterns to memory are equally expen¬ 
sive, however. Sequential access patterns are more efficient 
because they leverage prefetching and result in few TLB 
misses. Similarly, writes are more expensive than reads be¬ 
cause they require exclusive access to the cache line under 
the MESIF cache coherence protocol, which may require 
cache line invalidations in other CPU cores. (We outline the 
four access patterns in Table [T]) We thus model the response 
time for processing query Q as being proportional to the 
number of operations N{k) weighted by the performance 
cost wa for each access pattern A G {SR, RR,SW, RW}: 

Time{Q) oc wsr WjSR)-f wrr • A(RR) 

-f wsw ■N{S\N) -f wrw ■N{R\N) 

The performance cost wa for each access pattern A is ob¬ 
tained experimentally for each system through a controlled 
microbenchmark procedure, which is described in detail in 
Section 13.11 The number of operations N{-) is calculated 
from our model in Sections 13.21 and 13.31 Section 13.41 sum¬ 
marizes key insights obtained by our model. 

Modeling assumptions and limitations 

Our model assumes that memory is shared by all process¬ 
ing cores, it is automatically kept coherent by the hardware 
and that data transfer operations are performed at the granu¬ 
larity of a cache line. We also assume that the database and 


the working memory for a query can fit in memory and no 
paging to secondary storage occurs. We only consider multi¬ 
join queries that are evaluated in parallel from all threads 
using the non-partitioned hash-join algorithm by Balkesen 
et al. 10, which has been shown to offer competitive perfor¬ 
mance while judiciously using memory 0. 

The fact that our proposed model does not account for any 
caching or NUMA effects does not imply that architecture- 
awareness is not important. On the contrary, the memory I/O 
cost model is an abstraction of a well-designed query engine 
that is not compute-bound and will not trigger redundant 
memory I/O during query processing. Inefficient implemen¬ 
tations of the non-partitioned hash join will result in higher 
memory traffic, which will not be captured by the model. 
In particular, we assume that the hash table is striped on all 
available NUMA nodes. 

Finally, a limitation of join cost modeling is the assump¬ 
tion that the cardinality of intermediate join results is known 
or can be estimated reasonably accurately. We acknowledge 
that cardinality estimation is not a solved problem . Inac¬ 
curacies in estimating the cardinality or the skew of the build 
side are particularly problematic. Improperly sized hash ta¬ 
bles will result in the creation and probing of overflow hash 
buckets. Some DBMSs will track excessive bucket overflows 
and pause the hash table build process to resize the hash table 
and accommodate more tuples. New join algorithms have re¬ 
cently been proposed to avoid populating the hash table with 
heavily-skewed data J^ . Bucket overflows and hash table 
resizes are not accounted for by the proposed model. 

3.1 Bootstrapping the model 

The model depends on the performance cost wa for each 
access pattern A G {SR, RR, SW, RW} for accurate predic¬ 
tions. These general performance costs encapsulate (1) dif¬ 
ferences in the underlying hardware of each system such 
as different CPU models or different memory modules, (2) 
some hardware configuration options such as memory tim¬ 
ings and NUMA settings that can be changed in the BIOS, 
and (3) OS-specific configuration options, such as the num¬ 
ber and size of huge TLB pages. Bootstrapping consists of 
four specific microbenchmarks where all threads transfer 
data from memory such that only one specific access pat¬ 
tern is triggered at a time. The microbenchmarks do not per¬ 
form any computation, so the derived performance costs can 
be thought as a “worst-case” scenario where all threads are 
waiting on memory I/O. 

To derive the performance cost for the SR and SW pat¬ 
terns each thread allocates a NUMA-local contiguous array 
and populates it with random values. Each array is sized to 
a power of two such that as much of the available memory 
in the system is used, but no paging to secondary storage 
occurs. Every thread then concurrently reads its local array 
sequentially and either accumulates an 8-byte value if the ac¬ 
cess pattern is SR, or accumulates and writes an 8-byte value 
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Symbol 

Meaning 

A 

Access pattern, one of SR, RR, SW, RW. 

WA 

Performance cost for access pattern A 

A(A) 

Number of operations of access pattern A 

CL 

Cache line size 

1^1 

Cardinality of relation R 

W{R) 

Size of each tuple in relation R 

Br 

Hash table on R 

\Br\ 

Number of hash buckets in Br 

W(Bff) 

Hash bucket size in Br 

T 

Tuples per bucket (ie. hash table load factor) 

BH 

Size of metadata per hash bucket 

Qm,k 

Query subtree k at depth m 

^m,k 

Intermediate output of (cf. Figure[3]l 


Table 2. Notation. 


back if the access pattern is SW. (Notice that the SW access 
pattern includes an explicit memory read.) 

The RR and RW microbenchmarks use an array of the 
same size as for SR and SW that is now shared between 
all threads. Before the benchmarking, each thread first gen¬ 
erates a random sequence of cache-aligned array offsets in 
NUMA-local memory and waits on a barrier. The bench¬ 
marking then starts, each thread accesses an entire cache line 
at every offset in the randomly generated sequence. If the ac¬ 
cess pattern is RR, every thread only accumulates values in 
an 8 -byte local variable while reading the cache line. If the 
access pattern is RW, every thread reads and writes the entire 
cache line in 8 -byte units. 

The performance cost wa is calculated by dividing the ex¬ 
ecution time of the A G {SR, RR, SW, RW} microbenchmark 
with the number of cache lines that were accessed. 

3.2 Modeling individual operators 

In this section we analyze the memory access cost of indi¬ 
vidual operations. The notation is summarized in Tabled 

Sequential scan 

All threads read data sequentially from memory during the 
scan. Scanning entails no random reads or memory writes. 
For a relation R, the number of SR memory operations is 
determined by the size of the input relation and cache line 
size. Therefore: 


^scan(R) (SR) 


1^1 ■W(^) 

CL 


( 2 ) 


(RR) = (SW) = (RW) = 0 (3) 

Hash join build phase 

During the hash join build phase every thread concurrently 
reads tuples from the build relation, hashes the tuple on the 
join key and inserts it into a shared hash table. Hashing the 



Figure 2. Modeling the cache line (c;) writes in a hash 
bucket, when inserting tuples r, . 


join key only involves computation, thus there will be no 
memory accesses. As a good hash function is designed to 
assign different keys into different hash buckets, the hash 
buckets will be accessed in random order. Once the appropri¬ 
ate hash bucket has been identified, the thread needs to latch 
the bucket before writing into it to prevent insertion races 
with other threads. The latch and other metadata informa¬ 
tion are stored in a bucket header structure at the beginning 
of the bucket and have size BH. Acquiring the latch will re¬ 
sult in a local modification to the cache line. In the absence 
of any latch contention, releasing the latch will modify the 
same cache line again. Since the latch has likely already been 
brought into the cache after the latch acquire operation, the 
latch release operation will hit the same cache line and will 
not trigger additional memory accesses. Therefore one RW 
memory accesses will occur per tuple in the build relation R: 


■^faH!W(«)(RR) — ■^ 6 m'W(ff)(SR) —0 (4) 


■^fourW(/t)(RW) — |f?| 


(5) 


Writing each R tuple in the hash bucket will require 
additional writes. The number of memory writes depends on 
how many cache lines overlap with the modified memory 
region and the write pattern. Let us consider the insertion of 
T tuples from relation R (denoted by ri, r 2 ,...) into the hash 
bucket that consists of the cache lines ci, C 2 ,... that is shown 
in Figure 12] When the first tuple ri is inserted it fits entirely 
into Cl, causing one memory write. When ri is inserted, it 
spans both the ci and the ci cache lines and requires two 
memory writes. Similarly, both and trigger two memory 
writes each because they span cache lines, but inserting rs 
only requires one memory write as it fits completely within 
the C 4 cache line. 

We can derive a closed form solution for the memory 
writes that are triggered by inserting T tuples of width 
W {R) into cache lines of size CL by calculating the bound¬ 
aries of the cache lines and the tuples, which are shown 
as vertical dashed lines in Figure |2| We apply the inclu¬ 
sion/exclusion principle and first add the tuple boundaries 
{k\) and cache boundaries (^ 2 ). We will then subtract the 
boundaries that are both tuple and cache line boundaries 
(^ 3 ), such as the boundary between and rg in Figure |2| 
The ri,..., rj- tuples have ki = T — \ boundaries and the 


cache lines have k 2 = 


T-W(R)- 


CL 




boundaries. The common 


boundaries can be found in the locations that are both a 
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multiple of W{R) and a multiple of CL. If lcm{a,b) is the 
least common multiple of a and b, these locations are pre¬ 
cisely at 1 X Icm{W(R) , CL), 2 x Icm{W[R) , CL), .... Thus, 

the number of common boundaries is = — t-'w(r)-\ ^ 

lcm(W(R),CL 

We conclude that the number of memory accesses to in¬ 
sert T tuples of size W{R) into one hash bucket will be 
.j^{T,W{R)) = k\-\-k 2 — k^ + l, which is: 


.^{T,W{R)) = 


T-W{R)-l 


T-W{R)-1 

CL 


lcm{W{R),CL) 


(6) 


If the hash table B]^ on relation R has |Bff| hash buckets 
and each bucket is sized to fit [I^I/|Br|] tuples, the total num¬ 
ber of SW memory accesses is: 


A^ 6 „,-m(«)(SW) = |B«|x^ 




,WiR) (7) 
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Figure 3. Left-deep tree, bushy tree and right-deep tree. The 
output of a join is pipelined into the next operator. The black 
dot denotes the build side, that is the join input that will be 
buffered in a hash table before processing the probe side. 

fit in the cache and not generate any memory access when 
populating the output buffer in a well-tuned system. Our 
model of the probe phase of the hash join therefore includes 
no SW or RW operations: 

NprobeiS) (S W) = Np,,i,e{S) ( R W) = 0 (11) 


Hash join probe phase 

After the build phase completes, each thread will read tuples 
from the probe relation S, hashes each tuple on the join 
key and compares the join key of the tuple from S with the 
join keys of all the tuples in the corresponding hash bucket. 
No memory access is involved when evaluating the hash 
function. The join key will then need to be compared with 
every key in the hash bucket. (If the hash bucket is empty, 
the hash bucket metadata will still need to be retrieved to 
discover this.) As the bucket is determined by the hash of 
the join key of the input tuple, the first access to the hash 
bucket is RR for every tuple in the probing relation S: 


■^profce(S)(RR) — l‘^l (8) 

All tuples in the hash bucket will be retrieved from mem¬ 
ory for the join key comparison using the SR access pattern. 
The number of sequential reads is determined by the number 
of additional cache lines the bucket occupies. Each bucket in 
the hash table Bp will contain W{R) ■ T +BH bytes of data, 
but the first CL bytes have already been accounted for in 
the RR access pattern. Assuming that each hash bucket fits 
T = tuples, let be the number of SR memory 

accesses per probe: 


■^{R) 


W{R)- 



CL 


+ BH 


-1 


(9) 


A^p™MS)(SR) = IXxJS(’(/?) (10) 

The output of the join will be written to an output buffer 
of each thread so as to be used by the next operator. Since the 
queue buffer is reused between operators, it can be sized to 


3.3 Modeling multiple joins in a query tree 

The number of memory accesses for the subtree rooted 
at depth i and position j (see the bushy tree in Figure [^l 
is obtained by adding the memory accesses of the oper¬ 
ator at depth i and position j with the memory accesses 
of the left and right subtrees for each access pattern A G 
{SR, RR,SW, RW}. If Lj denotes the intermediate result 
from the operator at depth i and position j, the number 
of memory accesses with pattern A for the entire query is 
A(A) =Nofi{A), where: 


{ NscaniR,]{^) ifseq. scan, 

Nbuild{Ijpi^2j) (^) ^probe{Ii+ipjpi)(.^) 

+Ni+i, 2 j{A) +N,+i. 2 ;+i(A) if hash join. 

( 12 ) 

Equation [12] does not account for the cost of material¬ 
izing the output lij. For intermediate results, we have ex¬ 
perimentally found that this is not a significant cost if the 
output buffer has been sized appropriately d — the buffer 
that stores the materialized output batch for the next operator 
stays cached and does not trigger memory accesses. (Note, 
however, that the cost of processing a larger intermediate re¬ 
sult is captured in the cardinality and width of Cj when ac¬ 
counting for the memory accesses of subsequent operations.) 
We validate this modeling choice in Section 14.2.21 The fi¬ 
nal join output Iq o is also materialized incrementally if it 
is consumed by an aggregation or a top-k operation. (This 
pattern, for instance, occurs in every query in the TPC-H de¬ 
cision support benchmark.) If the entire query output is to be 
materialized, the model can be trivially extended with addi- 
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tional SW accesses to reflect the materialization cost. This 
cost would be identical for all query plans. 

3.4 Insight on join evaluation orders 

This model can offer valuable insights into the memory I/O 
requirements of different query plans. We illustrate this by 
modeling the common case of a linear primary key-foreign 
key join between n + I relations. Let that the multi-join 
operation be /?o Rn-i Rn, where \Ri l<l^ 

for every i S [0,n — 1], The join sequence for a left-deep 
evaluation strategy Ql is: 



While the right-deep evaluation strategy Qr is: 

Qr=(^Ro>^(^Ri>^{--->^ {R„-i N))^ 

Let us consider the common case of a join that produces 
as many tuples as the probe side, as is commonly the case 
for primary key-foreign key joins. Closed formulas for the 
memory access cost of the right-deep query tree Qr that is 
shown in Figure [2 can be obtained by setting In = Rn, as 
follows: 


NiSR) = ^ 

i=0 


CL 


i=\ 

(13) 


n — \ 


/V(SW)= ^ ( \Br^\x^ 

i=0 


M 

\Br^ 


,W{R. 


(14) 


/V(RR) = £|/,| 

i=l 


n—\ 


N{RW) = ^ \Ri\ (15) 

!=0 


The cost for a left-deep tree Qr, with /„ = Rq as shown in 
Figure[2 is: 


N{SR) = ^ 

(=0 


CL 


i=\ 

(16) 


(=1 V 



(17) 


N{RR) = J^\R.\ 

i=i 


/V(RW) = ^|/,| (18) 

r=l 


memory writes. A^(RW) and N{S\N) are identical for the 
two query trees, because the cardinality of the k-th interme¬ 
diate output is |4| = \Rn-k \ for the left-deep strategy Qr, and 
|4| = \Rn \ for the right-deep strategy Qr. 

Insight 2: The left-deep {Qr) evaluation strategy per¬ 
forms n\Rn\ — L"=i \Ri\ fewer RR memory accesses than 
the right-deep {Qr) strategy. The RR accesses for the left- 
deep strategy are A4(RR) = L;Li \Ri\, while the RR accesses 
are Nr{RR) — n\Rn\ for the right-deep strategy. A^(SR) is 
identical in both strategies if .jSf(-) =0, namely if the tuples 
are thinner than the last-level cache line. This is commonly 
the case due to projection push down or columnar storage. 

3.5 Possible extensions to the model 

The proposed model accounts for wide tuples but assumes 
that the tuple width is fixed per input relation. This may 
not be accurate if some attributes represent variable length 
fields such as VARCHARs, BLOBs or nested structures like 
JSON objects. The model could crudely be applied by using 
the average tuple size, but accurate results would require 
a complete histogram of the attribute size per join key. A 
research challenge is to devise techniques to collect such 
metadata information efficiently. 

Another promising direction is to extend the model for 
distributed in-memory joins. We foresee two challenges in 
generalizing our model to distributed in-memory hash joins: 
modeling data placement and locality and modeling the cost 
of remote data transfers. Our model currently does not con¬ 
sider data placement. It seems unlikely that a model that does 
not account for locality can remain accurate when some ac¬ 
cesses become remote and thus an order of magnitude more 
expensive than what is currently modeled. Regarding model¬ 
ing the cost of remote data transfers, Barthels et al. have 
proposed a performance model to estimate the cost of trans¬ 
ferring data using RDMA primitives for join processing. 

Hash tables are widely used for storing data for in¬ 
memory caching or in-memory transaction processing. Mod¬ 
eling the cost of each access is a first step towards under¬ 
standing the performance characteristics of complex opera¬ 
tions (such as transactions) that consist of a series of reads 
and writes on a hash table. We speculate that the main lim¬ 
itation of our model when applied to ad-hoc operations is 
to account for the effect of concurrent writes and reads. 
A promising research direction is to construct contention- 
aware performance models for in-memory data processing. 


Cost per access pattern 


Vendor, CPU or instance type 

IVSR 

Wrr 

Wsw 

Wrw 

Intel, 2xE5-2695v2 

1.00 

3.79 

5.03 

6.25 

AMD, 2xOpteron 6172 

1.00 

6.44 

1.88 

8.42 

Amazon EC2, c4.4xlarge 

1.00 

6.81 

5.21 

13.86 


Insight I: The left-deep {Qr) and right-deep {Qr) eval¬ 
uation strategy perform the same number and type of 
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Table 3. Measured cost per access pattern for all systems. 
(See Section im for details on the bootstrapping process). 


















Predicted memory I/O Observed memory I/O Response time slowdown 


Load 

factor 

RW 

SW 

Total 

writes 

SR 

Total 

reads 

Writes 

Reads 

Pred. 

Observ. 

Error 

BuM(R) 

Build(7?) 

Scan(i?) 

Total 

Error 

Total 

Error 

1.0 

512 

0 

512 

128 

640 

505 

1.4% 

634 

0.9% 

— 

— 

— 

2.0 

1,024 

0 

1,024 

256 

1,280 

1,019 

0.5% 

1,277 

0.2% 

2.00X 

1.97x 

1.5% 

3.0 

1,536 

0 

1,536 

384 

1,920 

1,531 

0.3% 

1,921 

-0.1% 

3.00X 

2.94x 

2.0% 

4.0 

2,048 

512 

2,560 

512 

3,072 

2,555 

0.2% 

3,081 

-0.3% 

4.77 X 

4.77x 

0.0% 

5.0 

2,560 

1,024 

3,584 

640 

4,224 

3,580 

0.1% 

4,234 

-0.2% 

6.55X 

6.56x 

-0.2% 

6.0 

3,072 

1,536 

4,608 

768 

5,376 

4,605 

0.1% 

5,388 

-0.2% 

8.32X 

8.36x 

-0.5% 

7.0 

3,584 

2,048 

5,632 

896 

6,528 

5,629 

0.1% 

6,548 

-0.3% 

10.09X 

10.14X 

-0.5% 

8.0 

4,096 

2,560 

6,656 

1,024 

7,680 

6,656 

0.0% 

7,701 

-0.3% 

11.87X 

11.98X 

-0.9% 


Table 4. Model accuracy for different hash table load factors. I/O events in millions. 


Predicted memory I/O Observed I/O Response time slowdown 


|5| Join 

1^ selectivity 

|5N/;| 

(millions) 

RW 

RR 

SR 


Total 

Total 

Error 

Pred. 

Observ. 

Error 

BuM(R) 

Probe(5) 

Scan(i?) 

Scan(5) 

4x 

25% 

512 

512 

2,048 

128 

512 

3,200 

3,136 

2.0% 

2.15x 

2.04 X 

5.4% 

3x 

33% 

512 

512 

1,536 

128 

384 

2,560 

2,625 

-2.5% 

1.77x 

1.71x 

3.5% 

2x 

50% 

512 

512 

1,024 

128 

256 

1,920 

1,957 

-1.9% 

1.38x 

1.37x 

0.7% 

lx 

100% 

512 

512 

512 

128 

128 

1,280 

1,299 

-1.5% 

— 

— 

— 

2x 

100% 

1,024 

512 

1,024 

128 

256 

1,920 

1,876 

2.3% 

1.38X 

1.37x 

0.7% 

4x 

100% 

2,048 

512 

2,048 

128 

512 

3,200 

3,270 

-2.1% 

2.15X 

2.10X 

2.4% 

6x 

100% 

3,072 

512 

3,072 

128 

768 

4,480 

4,583 

-2.2% 

2.92x 

2.84x 

2.8% 

8x 

100% 

4,096 

512 

4,096 

128 

1,024 

5,760 

5,907 

-2.5% 

3.68X 

3.56x 

3.4% 


Table 5. Model accuracy for larger probe relations and different join selectivities. I/O events in millions. 


4. Experimental evaluation 

This section describes the experimental evaluation and val¬ 
idation of our model. Sections 14.2.11 and 14.2.21 focus on a 
single join and evaluate the accuracy of the model in pre¬ 
dicting the memory I/O activity. We validate that I/O ac¬ 
tivity is linearly correlated with query response time, and 
we find that the impact of incrementally materializing inter¬ 
mediate results to the next pipeline stage is negligible. Sec¬ 
tion 14.2.31 evaluates how parallel subtree evaluation affects 
performance, and validates that sequential subtree evaluation 
results in similar query response times to the fastest parallel 
evaluation strategy but with less memory pressure. 

We then turn our attention to multi-join queries, and we 
exhaustively compare all possible multi-join query plans for 
a join with four input relations. Section l4.3.1l discusses how 
our simple memory I/O model compares with a linear disk- 
based I/O model, such as the Haas et al. model lE^I . and 
whether parameter tuning of disk-based models is sufficient 
to predict the response time of queries on memory-resident 
data. Section |432] highlights how our model adapts to dif¬ 
ferent hardware and can accurately predict response time 
“drifts” of particular query plans as the memory hierarchy 
evolves. The remainder of the section evaluates the sensi¬ 
tivity of our model to changes in the database size (Sec- 
tion l4.3.3l l. join skew (Section [4. 3. 4b . input cardinality ratio 


(Section l4.3.5b and join selectivity (Section [4.3.6b . Finally 
Section l4.3.7l evaluates the predictive power of our model in 
forecasting the cost of deeper multi-join pipelines. 

We have extended Pythia 0, a prototype open-source 
in-memory query engine to support multi-way joins and 
complex query pipelines. We started with the optimized non- 
partitioned hash join implementation of Balkesen et al. ||3] 
to create a hash join operator that can be placed in com¬ 
plex query pipelines using a parallel, vectorized, pull-based 
execution model HU 13]. Building on prior work lf3 . we 
prefetch hash buckets before they are accessed by “peek¬ 
ing ahead” within the batch of input tuples. We have built 
a custom memory allocator to use huge pages for memory 
requests larger than 1 GB. By tightly controlling memory 
allocation we can enforce NUMA placement, support pre¬ 
allocation during query planning time, and precisely monitor 
the memory consumption of a query during processing. 

4.1 Hardware setup and methodology 

We use three systems for our experimental evaluation. The 
first system (which we will refer to as “Intel”) is a server with 
two NUMA nodes and 256 GB of DDR3 memory, with two 
Intel Xeon E5-2695v2 12-core processors. Each core in an 
E5-2695v2 processor shares the 30 MB L3 cache. We have 
enabled hyper-threading, so the operating system can simul- 













































taneously execute 48 threads. The second system (which we 
will refer to as “AMD”) is a server with four NUMA nodes 
and 32 GB of memory, with two AMD Opteron 6172 12- 
core processors (in total, 24 concurrent threads). Each core 
in an Opteron 6172 processor shares the 12 MB L3 cache. 

In a cloud environment, the precise hardware specifi¬ 
cations may not be known before the VM is instantiated. 
An advantage of the proposed cache-oblivious model is its 
adaptability to hardware with unknown specifications. The 
third system (which we refer to as “EC2”) is an Amazon 
EC2 instance of type c4.4xlarge. The instance has 30 GB 
of memory and 16 vCPUs. To control for cyclical variations 
in performance, we bid for a single spot EC2 instance around 
10am every morning (US Eastern time) for 7 consecutive 
days. The observed day-to-day variation in performance is 
less than 2% so we only report averages over all seven days 
for all EC2 experiments. 

The measured performance cost wa for each access pat¬ 
tern A among SR, RR, SW and RW for the three systems is 
shown in Table |3] (Please refer to Section ITTI for details on 
the bootstrapping process.) We use the built-in “perf” tool 
of the Linux kernel to tap into hardware counters of perfor¬ 
mance events related to memory accesses. To minimize re¬ 
porting overheads from “perf” we only inspect these coun¬ 
ters after the completion of the initialization, the build and 
the probe phases (that is, three times per join operation). Ev¬ 
ery experiment is repeated at least ten times, and we report 
the average. We assume that a memory manager preallocates 
memory when the DBMS starts, and queries can quickly 
reuse pre-allocated memory. Thus, memory allocation time 
is not included in the reported query response time. 

4.2 Validating modeling assumptions 

This section validates key modeling assumptions using the 
Intel system that is described in Section 14.11 We first dis¬ 
sect a single hash join to evaluate how accurately we model 
memory I/O traffic and whether memory traffic is a good 
predictor of the join execution time. We then turn our atten¬ 
tion to how parallel evaluation of distinct build subtrees of a 
query plan affects performance. 

4.2.1 Hash join build phase 

The first experiment evaluates the accuracy of the model 
in capturing the memory I/O activity and predicting the 
completion time of the hash join build phase. We create one 
relation R with two attributes. The first attribute is an 8-byte 
integer which ranges from 1 to the cardinality of relation |/?|, 
and the second attribute is a random 8-byte integer. Narrow 
tuples are commonly encountered in column-stores and have 
been extensively used for experimental evaluation in prior 
work 10. The input relation has been randomly permuted so 
that the keys are in random order, and is striped among both 
NUMA nodes of our system. Eor this experiment we fix the 
number of hash buckets |B| to 512 million and use 16 bytes 


per bucket as the bucket header BH; the first cache line can 
therefore hold 3 tuples. 

The experimental results and the model predictions for 
building a hash table are shown in Table |4] We vary the car¬ 
dinality of the build relation R from 512 million (lx) to 
4 billion (8x) tuples. When the build relation size varies 
from lx to 3 X, all tuples will be in the same cache line with 
bucket header, and N{S\N) = 0. When the build relation size 
reaches or exceeds 4x, tuples will be inserted in the next 
available cache line and will cause SW traffic. When com¬ 
paring the model prediction with the observed I/O activity, 
we notice that the read and the write prediction is very accu¬ 
rate and commonly deviates less than 1% from the observed 
memory traffic. When comparing the model prediction with 
the experimentally observed execution time, we observe that 
the memory I/O activity is an accurate predictor of the exe¬ 
cution time of the build phase, as it commonly deviates less 
than 2% from the observed time. 

4.2.2 Hash join probe phase 

We now evaluate the accuracy of our model in capturing 
the hash join probe phase. We use the same build relation 
R as in Section 14.2.11 and fix the cardinality of R to 512 
million tuples (lx). As shown in Table |5] the cardinality of 
the probe relation |5| varies from 1 x 512M to 8 x 512M, 
and the probability that an S tuple finds a matching R tuple 
in the hash table varies from 25% to 100%. We observe 
that the model’s memory activity prediction is accurate and 
within 3% of the observed memory read traffic. Memory 
I/O activity is again linearly correlated with query response 
time. In fact, the model’s prediction slightly overpredicts 
the query response time. Einally, the cardinality of the join 
output |5 XI /?| does not noticeably affect the memory I/O 
activity and the response time, because the join output is 
materialized incrementally in the same output buffer for 
pipelined operations. This validates our design choice to 
not explicitly account for the output materialization cost of 
intermediate results. 

4.2.3 Parallel vs. sequential build subtree evaluation 

Schneider and DeWitt S argued that one aspect of supe¬ 
riority of right deep trees over left deep trees is their poten¬ 
tial for exploiting parallelism in their independent build sub¬ 
trees. In this section, we study the performance of utilizing 
this inter-subtree parallelism for memory-resident datasets. 
We create three relations that contain 1 billion tuples each, 
and evaluate the query plan shown in Eigure |4(a)| Every 
probe tuple matches one build tuple in the Ji and J 2 joins. 
We execute the query plan both sequentially and in parallel. 
When we execute the query sequentially we use all available 
threads to build R\ first, and then move on to build R 2 after 
completing the build of When we execute the query in 
parallel, the threads are divided into two groups: k threads 
build a hash table on Ri while the other 48 — k concurrently 
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(a) k threads build the Ji hash table, 
while the remaining threads build the J 2 
hash table. All threads participate in the 
probe pipeline. 
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(b) Response time as the num¬ 
ber of threads deviates from 
the optimal allocation. 


Figure 4. Exploring sequential or parallel build subtree 
evaluation for right-deep query trees. 

build a hash table on All threads participate in the probe 
phase for both evaluation strategies. 

The response time of the t'wo evaluation strategies is 
sho'wn in Figure |4(b)] on the vertical axis, as the number of 
threads assigned to Ri changes on the horizontal axis. Be¬ 
cause the t'wo relations have equal size, the optimal alloca¬ 
tion is A: = 24 which divides the threads into two equally- 
sized groups. Performance will degrade if the optimal thread 
assignment is not predicted correctly by the query optimizer 
■when processing more complex query subtrees. However, 
because the communication costs of a main-memory envi¬ 
ronment are negligible compared to shared-nothing architec¬ 
tures, the performance of building the hash tables on Ri and 
R 2 sequentially is equal to the performance of parallel eval¬ 
uation with the best thread allocation. 

An advantage of sequential evaluation is the memory con¬ 
sumption of the query. When building sequentially, 64GB of 
memory is allocated in the beginning, which corresponds to 
the memory requirement for building Ri. The memory con¬ 
sumption further increases by 64GB (the amount of mem¬ 
ory needed for building R 2 ) later in the query, and remains 
constant. When building in parallel, 128GB of memory is 
needed upfront to start the query, if both Ri and R 2 are to be 
built concurrently. Although the performance and the peak 
memory consumption is the same, the sequential evaluation 
strategy imposes less pressure to the memory manager and 
the DBMS because it gradually reaches its peak memory al¬ 
location. In light of these findings, we process independent 
build subtrees sequentially in the experiments that follow. 


4.3 Evaluation with multi-join query plans 

In order to exhaustively evaluate our model’s prediction for 
multi-join query plans, we generate a synthetic database with 
four relations Rq, Ri, R 2 and Rj. All relations contain two 
long integer attributes a and b, and the tuple size is 16 bytes. 
The attribute a ranges from 1 to the cardinality of the relation 


A A x A A 

A"’A"” f/Vd 

A R. Ro A /\ / \ ^ R3 

/ \ / \ 1^0, RiR? R 3 / \ / \ 

Rq Ri Ri R2 Ri R2 R2 1^3 

L0123 LB0123 bVi23' RB0123 R0123 


Figure 5. Query plan notation for the exhaustive 4-relation 
join experiment. The letters denote the tree shape, followed 
by the input relations when reading from left to right. 


and the values are randomly distributed. We evaluate the 
following SQL query: 


SELECT SUM(/?o-a + R^ .b) 

FROM Ro, Ri, R 2 , R 3 

WHERE Ro.b = Ri-a, Ri .b = R2.a,, R2.b = R^.a 


This synthetic database with 4 relations allows us to ex¬ 
plore the five query plan shapes which are shown in Figure|2 
namely the left-deep (L), left-bushy (LB), bushy (B), right- 
bushy (RB), and right-deep (R) tree. We use “TXXXX” to 
refer to a query plan, where “T” is the query type, followed 
by a sequence of numbers denoting the input relations when 
reading from left to right. By focusing on a database with 
4 relations, we can systematically explore the entire query 
plan space, while keeping the number of query plan can¬ 
didates within a reasonable limit for presentation purposes. 
(We relax this constraint in Section l4.3.7l where we evaluate 
the model with query plans that involve more than 3 joins.) 

4.3.1 Comparison with a popular disk-based model 

Do we need a new cost model for memory I/O, or is adjust¬ 
ing the weights of existing linear disk-based models suffi¬ 
cient? We answer this question by comparing the accuracy 
of our model with the PostgreSQL 9.3 disk I/O model which 
mimics the Haas et al. ll^ model: 

Cost{Q) ns-Cs + Ur-Cr (19) 

The number of I/O operations n* and for a given query 
plan are predicted by the query optimizer and is a function 
of the cardinalities of the join inputs and the join output. The 
cost weights c* and Cr are user-adjustable parameters. 

We exhaustively explore all viable multi-join query plans 
with 4 relations for the query that is shown in Section 14.31 
We set the relation cardinality of Rq, Ri, R2, R3 to 2 bil¬ 
lion, 512 million, 128 million and 32 million tuples respec¬ 
tively. The join ratio is 1:4, that is for every value of Rj^^i.a, 
there are 4 tuples with matching Rf^.b values. We force Post¬ 
greSQL to use a specific join order and explicitly set the car¬ 
dinality estimates to the actual join cardinality instead of re¬ 
lying on the prediction by the query optimizer. We obtain 
ris by setting w* to 1 and Wr to 0, and we report the total 
cost for the query plan, as shown by PostgreSQL’s EXPLAIN 
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PostgreSQL disk I/O model prediction (arbitrary units) 
(a) PostgreSQL disk I/O cost model, rp=0.740, r5=0.724 



10000 20000 30000 40000 60000 

Model prediction (arbitrary units) 

(b) Our model on the Intel system, rp=0.970, rj=0.982. 



3000 4000 5000 6000 7000 9000 4000 6000 8000 10000 14000 

Model prediction (arbitrary units) Model prediction (arbitrary units) 


(c) Our model on the AMD system, rp=0.881, r,=0.903. (d) Our model on the EC2 instance, rp=0.939, r5=0.989. 

Figure 6. Observed query response time vs. predicted cost from the PostgreSQL I/O cost model with least-squares fit, and our 
model for the Intel, AMD and EC2 systems, rp denotes the Pearson correlation coefficient, and the Spearman correlation 
coefficient. 
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command. (We do likewise to find n^..) We finally perform 
linear regression to find the optimal non-negative costs to 
minimize the error between the cost prediction and the ob¬ 
served execution time. 

Figure [6(a)| shows the prediction of the PostgreSQL disk 
I/O model on the horizontal axis (after parameter fitting 
using linear regression) and the observed response time on 
the Intel system on the vertical axis. Even with optimal cost 
weights, the predictions of the PostgreSQL disk I/O model 
are clustered in five points (shown as vertical bands) on 
Figure |6(a)| PostgreSQL gives the same cost prediction for 
plans with very different execution times; it predicts that the 
best two plans (“L3210” and “L2310”) will have the same 
response time as the “R0132” and “R0123” plans which 
are in reality twice as expensive. Making matters worse, 
“R3210” is the optimal right-deep query plan but is predicted 
to be twice as expensive as the “R0123” query plan. 

In comparison. Figure |6(b)| shows the prediction from 
our model on the Intel system, which successfully predicts 
the cheapest plans “L3210” and “L2310”. In addition, all 
observed response times lie within a narrow ±15% band 
of their prediction. A disk I/O model that only considers 
the size of the data accessed and whether this access is 
sequential or random cannot account for the access patterns 
that arise during an in-memory hash join. 

4.3.2 Model adaptability to different hardware 

In this section, we test the model’s accuracy on the different 
systems that were described in Section |4T| Because of lim¬ 
ited memory size, we set the cardinality of Rq, Ri, R 2 , R 3 to 
be 256 million, 64 million, 16 million, and 4 million tuples 
respectively. The join ratio is set to be 1 ;4 as in Section l4.3.11 
and the query is the 4-relation join described in Section |43] 
The predictions are more “noisy” due to caching effects that 
arise because of the smaller database size — for instance, R 3 
can be cached in its entirety in all systems. (We explore the 
model sensitivity to the database size in Section l4.3.3h 

Figure |6(c)| compares the model’s cost prediction with 
the observed query response time on the AMD system. Our 
model remains accurate as only a few of the 40 query plans 
fall outside the ±15% band. The proposed model also adapts 
to the hardware architecture of the underlying platform. Ob¬ 
serve the “L2130” and “B3210” query plans on the Intel 
system at Figure [6(b)] Both plans have been accurately pre¬ 
dicted to have comparable performance. Yet for the AMD 
system at Figure |6(c)| the model accurately predicts that 
the “B3210” query plan is 15% more expensive than the 
“L2130” plan. This difference can be attributed to the pro¬ 
portion of the RR activity of the two plans, which is 76% for 
the “B3210” plan but only 52% for the “L2130” plan. As the 
wrr weight changes from 3.79 on the Intel system to 6.44 
on the AMD system (cf. Table O, our model captures this 
performance difference and adjusts the cost predictions for 
all plans accordingly. 


We also test our model on the Amazon EC2 instance us¬ 
ing the same dataset as for the AMD system and show the 
results in Figure |6(d)] The EC2 results are similar to the re¬ 
sults from the Intel system, except for all the plans with pre¬ 
fixes “RBO” and “RO” that construct a hash table on Rq. Our 
model predicts that these plans are much faster than their ac¬ 
tual response time. A closer look at the performance coun¬ 
ters reveals that the discrepancy is due to the cost to translate 
virtual to physical memory addresses when accessing this 
large hash table. The Intel system was configured with 1GB 
huge pages that Pythia can readily use, while the EC2 in¬ 
stance had no huge page support. Modeling the hierarchical 
nature of the page table could improve the accuracy of the 
model when huge pages are not enabled or not supported. 

We now adopt more concise metrics to quantify model 
accuracy. Prior work ll25ll53n has used correlation metrics to 
measure the accuracy of the model prediction. The Pearson 
correlation coefficient rp is used to measure the linear corre¬ 
lation between two variables (namely, the prediction and the 
actual query response time). A deficiency of the Pearson co¬ 
efficient is that it can assign low coefficient scores to models 
that correctly predict the relative ordering of different query 
plans, but the correlation with the observed response time is 
non-linear. The Spearman correlation coefficient accom¬ 
modates non-linear models by measuring the linear corre¬ 
lation between the ranks of different query plans. For the 
remainder of the paper we will use both the Pearson (rp) and 
Spearman (r^) coefficients as metrics of model accuracy. 


4.3.3 Model sensitivity to the database size 

The proposed model assumes that caching affects all query 
plans equally. In this section we validate this design assump¬ 
tion by showing the accuracy of the model as a growing frac¬ 
tion of the database fits in the cache (that is, the database 
shrinks in size). We use the multi-join query in Section [431 
on the Intel system, and fix the join ratio to be 1:4. Fig¬ 
ure |7(a)| plots the Pearson and Spearman correlation coef¬ 
ficients as |Ro| ranges from 2 billion tuples to as little as 
8 million tuples. The data points for 2 billion tuples reflect 
the coefficients for Figure |6(b)| The dip for 1 billion tuples 
is caused by three outliers, namely the “LB2103”, “LI203” 
and “L2103” plans, whose response time is in the slowest 
quartile (25%) among all query plans. These very expensive 
query plans would realistically be never considered as viable 
alternatives by the query optimizer. The model remains very 
accurate with a Spearman coefficient of at least 0.95, until 
the database size becomes less than 1 GB (that is, when Rq 
has fewer than 32 million tuples). When [RqI is 8 million tu¬ 
ples, Ri, R 2 , and R 3 fit entirely in cache. Even then, both the 
Pearson and Spearman correlation coefficient for our model 
remain above 0.8 and 26 out of the 40 query plans have re¬ 
sponse times that are within ±15% of the model prediction. 
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Figure 7. Model sensitivity to changes in the database size, skew, and join selectivity for the Intel system. 


Cardinality ratio 


Coefficient 

1:1 

1:2 

1:4 

1:8 

Pearson 

undefined 

0.981 

0.970 

0.883 

Spearman 

undefined 

0.984 

0.982 

0.957 


Table 6. Model sensitivity to cardinality ratio. 


4.3.4 Model sensitivity to data skew 

Data skew has been shown to have a profound effect on per¬ 
formance even for a single join jil]. We now evaluate the 
effect of data skew on our model’s prediction. We use the 
multi-join query discussed in Section l43] and we fix the car¬ 
dinality of Rq, Ri, R 2 and R^ to be 2 billion, 512 million, 128 
million and 32 million tuples, respectively. Attribute Rj^.b 
ranges from 1 to |, and the probability for Rj^.b to refer¬ 
ence individual keys in Rk+i-a follows the Zipf distribution. 
We vary the Zipf factor from 0 (no skew) to 1.5. 

Figure |7(b)| plots the normalized query response time of 
query plans that exhibit probe-side data skew. Since our 
model doesn’t account for data skew, it produces the same 
prediction for all Zipf factors. The query plan response time 
remains stable when the Zipf factor is smaller than 0.5. 
When Zipf factor is 1, the most frequent value for Rq occurs 
more than 100 million times but the drop in query response 
time is less than 6%. More skew affects all query plans 
almost proportionally for Zipf factors as high as 1.5. Given 
that different query plans have up to 4 x different response 
times, the relative order between the query plans does not 
change: our model perfectly ranks all the queries that are 
shown in Figure |7(b)| (that is, = 1.0). 

4.3.5 Model sensitivity to input cardinality ratio 

We now evaluate the sensitivity of the model as the ratio of 
varies from 1 to |. When the ratio is 1:1, \Ro\ = |f?i| = 
1 ^ 2 ! = l^al and R^.b and Rk+i-a have a 1-to-l match. When 
the ratio is 1:8, = 8 x \Rk+i\ and Rk-b and Rk+i-a have 

a l-to-8 match. The correlation coefficients for different ra¬ 


tios are shown in Table |6] The coefficient for 1:1 is unde¬ 
fined because our model gives the same prediction to all 40 
query plans and results in a variance of zero. However, the 
model prediction for the 1:1 ratio is very accurate, as the ob¬ 
served response time for all 40 query plans varies less than 
±0.5%—well within the margin of statistical error. As the 
ratio changes, the Pearson coefficient stays above 0.88 and 
the Spearman coefficient stays above 0.95. The drop for the 
1:8 ratio is caused by three outliers (the “LB2103”, “L1203” 
and “L2103” plans) which are among the most expensive 
plans for evaluating this query and thus are not competitive 
evaluation strategies. 

4.3.6 Model sensitivity to join selectivity 

In this section we study the accuracy of the model when 
changing the selectivity of the join. We use the join query 
in Section I 43 ] The cardinalities of Rq, Ri, R2 and Rq are 2 
billion, 512 million, 128 million and 32 million respectively. 
The results are shown in Figure J7(c)| In the 1:8 dataset, the 
attribute Rk-b ranges from 1 to and every value occurs 
8 times, i.e. half of the tuples in Rk+i will have matching 
tuples in Rk and each Rk+i tuple will match 8 tuples in Rq. 
The other datasets are obtained likewise: for instance, in the 
1:1 dataset only '/t of the tuples in Rk will match in Rk+i, 
and each will match exactly once. As Figure [7(^ shows, the 
model remains accurate as the join selectivity changes and 
the Spearman coefficient stays above 0.9. 

4.3.7 Modeling longer join pipelines 

After focusing on the exhaustive exploration of the query 
plan space for a query with three joins, we now will explore 
how the performance of the best right-deep tree (“R3210”) 
and the best left-deep tree (“L3210”) changes as the number 
of joins increases, and whether our model can predict this 
accurately. We generate more relations with a join ratio of 
1:4 as in Section !?. 3. 11 and we grow each query tree accord¬ 
ingly. We plot the query response time when the number of 
joins increases for both the left-deep tree and the right-deep 
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Figure 8. Comparing the optimal left-deep and right-deep 
query plans as the number of joins increases. The perfor¬ 
mance gap grows to almost lOx, as predicted by the model. 

tree in Figure [8(a)l and contrast this with the result from our 
prediction in Figure |8(b)| The left-deep tree and the right- 
deep tree have the same performance when the number of 
joins is 1, as the trees are identical. As the number of joins 
increases, the performance of right-deep trees linearly wors¬ 
ens over the performance of the left-deep tree. The key draw¬ 
back of the right-deep tree is that by using the largest rela¬ 
tion as the probing relation the data of the largest relation 
“flows” through all pipeline stages and results in additional 
RR memory accesses in every stage. Our model accurately 
forecasts this performance trend for left-deep trees and right- 
deep trees as the number of joins increases. 

5. Concluding remarks 

A significant fraction of a database may fit entirely in main 
memory and can be analyzed in parallel using multiple CPU 
cores to keep query response times short. Accurately fore¬ 
casting the response time of different in-memory query plans 
is becoming important for query optimization in this envi¬ 
ronment. Towards this goal, we contribute a cost model that 
can accurately predict the query response time of ad-hoc 
query plans with multiple hash-based joins. 

A surprising insight from our model is that some left- 
deep query trees can be more efficient than their bushy and 
right-deep tree counterparts because they result in less mem¬ 
ory I/O during execution. Prior work in parallel hash-based 
multi-join query evaluation has advocated right-deep query 
trees because all build subtrees can be processed in parallel 
and then probed in a single pipeline 11481] . Our experimen¬ 
tal evaluation corroborates that in a main-memory setting 
evaluating query subtrees sequentially using all threads is as 
fast as evaluating separate query subtrees concurrently. This 
hnding suggests that evaluating one join at a time and storing 
the intermediate result in a hash table may be a viable query 
execution strategy over a memory-resident dataset as it can 
ameliorate the cascading effect of errors in join cardinality 
estimation. Query processing techniques that are adaptive to 
cardinality estimation errors are a promising research area 
for future work. 
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